I’m participating in this open datascience course and I’m now supposed to start my own project, work on it in RStudio, communicate it by using RMarkdown and share it via GitHub. Let’s see how frightened I will be of this “new technology”.
Check out my GitHub!!
The following has nothing to do with the actual course work. I’m just testing out stuff.
I first made the mistake of placing the datafile that I next wanted to read in, to a separate folder inside the IODS-project-folder. Therefore, when I tried to read in the file, my code wasn’t able to find it. After some googling I found out that this was because R console and Rmarkdown have separate independent working directories. You can read about this issue here.
So I start off with checking that I’m in the right working directory
getwd()
## [1] "C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project"
Everything’s in order. So now let’s read in the data that consists of my money expenditure during the past year.
money<-read.csv("Sirkesjam.csv",header=TRUE)
#let's see the first few rows
head(money)
## date amount
## 1 2018-10-26 10.40
## 2 2018-10-25 8.84
## 3 2018-10-25 7.78
## 4 2018-10-24 0.20
## 5 2018-10-23 60.00
## 6 2018-10-23 3.50
Let’s plot the data to see whether there are any nice (or worrying) patterns to see.
require(ggplot2)
ggplot(data=money,aes(date, amount))+geom_line()
For some reason the plot looks different in R compared to the knitted outcome shown above. I will get back to this later..
Here’s how the plot looks like in R:
My idea is to study whether I use more money on weekends or during holiday seasons. I could imagine that this is the case BUT, normally during weekends and holidays I go out hiking somewhere in the forests where there is less opportunities to use money so I might be wrong.
This week I’ve learned about data wrangling and linear regression analysis. The basis of all data science is efficient data wrangling. One has to be able to modify, filter, convert etc. the raw data to an appropriate format to be able to perform any statistical analysis with it.
I started my “statistical journey” from linear regression. This analysis finds out whether there are any statistically significant relationships between dependent and explanatory variables. Depending on the number of explaining factors the process is called either simple linear regression or multiple linear regression. I fitted a multiple linear regression model, interpereted the results and checked for the validity of my model.
Here’s how I did it..
Let’s start by reading in the data and exploring the structure and dimensions of the data:
#data can be read in with read.csv-command and it has headers, T means TRUE.
learning2014<-read.csv("C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project/data/learning2014.csv",header = T)
#str-command lists all the variables in the data, their class and first few observations
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
#dim-command lists the number of rows and columns in the data
dim(learning2014)
## [1] 166 7
The data comes from a questionnaire survey that measured students learning when studying statistics. The data consists of 166 observations (individuals) and 7 variables have been measured. Here’s a list explaining the variables:
Age = Age (in years) derived from the date of birth
Attitude = Global attitude toward statistics
Points = Exam points
gender = Gender: M (Male), F (Female)
Deep = Deep approach to learning (on a scale from 1 to 5)
Surf = Surface approach to learning (on a scale from 1 to 5)
Stra = Strategic approach to learning (on a scale from 1 to 5)
You can read more about the study variables here.
Let’s explore the variation in the variables:
#let's take a summary of all the variables
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
We can see that most of the students that participated in the study are female and around 25 years old. The three learning approach variables (deep, stra and surf) and the points that students have scored, all have a fairly even distribution.
Then let’s visualize how different variables affect the points that students are scoring:
# access the GGally and ggplot2 libraries
library(GGally)
## Warning: package 'GGally' was built under R version 3.4.4
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(ggplot2)
# create a plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
-Looks like students with higher (better) attitude are scoring higher points. There’s maybe a slight gender division but the main deduction is the same for both genders.
-Deep learning approach doesn’t seem to influence points scored.
-Strategic learning approach seems to have a slight positive effect on points scored.
-Surface learning approach seems to have a slight negative effect on points scored.
-Older student seem to be scoring less points.
In summary, it looks like attitude, strategic and surface learning approaches might affect points most strongly. But to be sure, we want to test it statistically.
Let’s fit a linear model where points are explained by attitude, stra and surf:
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra+surf, data = learning2014)
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the summary of the statistical test we can find out that the variables stra and surf do not have a statistically significant influence on points because their p-value (Pr(>|t|)) is greater than 0.05. This means that the points that the students get from a test are not dependent on their strategic or surface learning approach.
Attitude, however, seems to have a strong statistically significant influence on points because its p-value is way below 0.05. It seems that attitude has a great influence on points.
Let’s explore which variables we can drop as unnecessary. Probably surf but can we also drop stra?
In the following I use AIC values for model selection. You can read more about them here.
# step-command drops each variable one at a time and counts an AIC value which is the lowest possible for the best model.
step(my_model2)
## Start: AIC=557.4
## points ~ attitude + stra + surf
##
## Df Sum of Sq RSS AIC
## - surf 1 15.00 4559.4 555.95
## <none> 4544.4 557.40
## - stra 1 69.61 4614.0 557.93
## - attitude 1 980.95 5525.3 587.85
##
## Step: AIC=555.95
## points ~ attitude + stra
##
## Df Sum of Sq RSS AIC
## <none> 4559.4 555.95
## - stra 1 81.74 4641.1 556.90
## - attitude 1 1051.89 5611.3 588.41
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Coefficients:
## (Intercept) attitude stra
## 8.9729 3.4658 0.9137
#drop1-command does pretty much the same thing but drops the variables one at a time in the order they were in the model so you have to be careful in which order you present the variables in the model
drop1(my_model2)
## Single term deletions
##
## Model:
## points ~ attitude + stra + surf
## Df Sum of Sq RSS AIC
## <none> 4544.4 557.40
## attitude 1 980.95 5525.3 587.85
## stra 1 69.61 4614.0 557.93
## surf 1 15.00 4559.4 555.95
These analysis suggest that we should drop ‘surf’ from the model as a non-influencial variable but keep ‘stra’ still in the game. However, because the AIC difference to the next best model is less than 2, it is usually recommended to choose the simpler model.
Therefore, let’s fit a new model with only attitude:
#new model with only attitude and stra as explaining variables
my_model3 <- lm(points ~ attitude, data = learning2014)
#summary of the new model
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
From the summary of the final model we first find a distribution of residuals which describes the difference between the observed values and the estimated values of the sample mean.
Next there is a coefficient table. Regression coefficients represent the mean change in the response variable (attitude) for one unit of change in the predictor variable (attitude) while holding other predictors in the model constant.
The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect).
From these results I deduce that attitude has a strong positive effect.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is the percentage of the response variable variation that is explained by a linear model.
In our case the R-squared value is around 20 % which seems low but in some fields, it is entirely expected that R-squared values will be low. For example, any field that attempts to predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply harder to predict than physical processes.
The F statistic on the last line is telling you whether the regression as a whole is performing ‘better than random’ - any set of random predictors will have some relationship with the response, so it’s seeing whether my model fits better than I’d expect if all my predictors had no relationship with the response (beyond what would be explained by that randomness). This p-value is below 0.05 so my model makes sense.
Then it’s time for model validation. I made certain assumptions about the data when I decided to use a linear regression model in my analysis. Now I have to test that these assumptions hold and are not violated.
Let’s draw some diagnostic plots. These things are easier to interpret visually.
par(mfrow = c(2,2))
plot(my_model2,which=c(1,2,5))
The first plot is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers. If the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly.
In this case everything seems to be ok. There are few values (indicated with an id number) that have lower values but they are not clear outliers on my opinion.
The second plot is basically a Q-Q plot which is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.
In this case I use a NORMAL Q-Q plot which checks the assumption that the dependent variable is normally distributed. This assumption seems to hold.
The last plot helps me to find influential cases if there are any. Not all outliers are influential in linear regression analysis. When cases are outside of the Cook’s distance (dashed red line not even visible in this plot), the cases are influential to the regression results. The regression results will be altered if one excludes those cases.
In this case there are no influencial cases. The red dashed line outside of which the influencial cases would be found, is not even visible in this plot.
My model seems to be a valid one. It explains around 20 % of the variation in the points that students scored in an exam. The most important factor affecting positively the points was attitude.
So check your attitude!!
This week I learned about logistic regression..
Let’s start by reading in the data and exploring the structure and dimensions of the data:
#read in the csv
alc<-read.csv("C:/Users/Sirke Piirainen/Documents/GitHub/IODS-project/data/alc.csv",header = T)
#print the names of the columns
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data consists of student achievements in secondary education of two Portuguese schools. The data attributes include student grades (G1, G2 and G3), demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets were provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). These datasets have been combined so that numerical variables are averaged and categorical values are taken directly from the Mathematics dataset. Alcohol consumption of students during week days and weekends has been combined as an average and categorized yes or no for high use. Check the variable details here.
The data consists of 35 variables and 382 observations.
Next I want to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, I choose 4 interesting variables in the data and for each of them, I present a hypothesis about their relationships with alcohol consumption.
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) I assume students with very bad realationships have high alcohol consumption.
absences - number of school absences (numeric: from 0 to 93) I assume student with lots of absences have high alcohol consumption.
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) I assume that students with low weekly study time have high alcohol consumption.
higher - wants to take higher education (binary: yes or no) I assume students who don’t want to take higher education have high alcohol consumption.
#get some packages
library(tidyr); library(dplyr); library(ggplot2)
#select only the data that I am interested in
choose<-c("high_use","famrel","absences","studytime","higher")
alc2<-select(alc,one_of(choose))
#summary table of the data
summary(alc2)
## high_use famrel absences studytime higher
## Mode :logical Min. :1.000 Min. : 0.0 Min. :1.000 no : 18
## FALSE:268 1st Qu.:4.000 1st Qu.: 1.0 1st Qu.:1.000 yes:364
## TRUE :114 Median :4.000 Median : 3.0 Median :2.000
## Mean :3.937 Mean : 4.5 Mean :2.037
## 3rd Qu.:5.000 3rd Qu.: 6.0 3rd Qu.:2.000
## Max. :5.000 Max. :45.0 Max. :4.000
#draw barplots of variables to study their distribution
gather(alc2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
None of the variables seem to be evenly distributed.
library(GGally)
# create a plot matrix with ggpairs()
p <- ggpairs(alc2, mapping = aes(col=high_use,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
-There are more students that don’t have a high use of alcohol than those that do.
-There are more students that want to take higher education than those who don’t want to.
-There are many students with very little absences, some with a few absences and then again many student with more absences.
-Family relations seem to be fairly good for most of the students.
-Most of the students study 2-5 hours weekly. Only few students study a lot.
Looks like all the hypothesis that I made hold some truth although I guess I might not have enough data from all observed categories to verify my assumptions.
#fit a logistic regression model
m <- glm(high_use ~ famrel + absences+studytime+higher-1, data = alc2, family = "binomial")
#plot a summary table
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + absences + studytime + higher -
## 1, family = "binomial", data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3321 -0.8215 -0.6473 1.1762 2.1735
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## famrel -0.23369 0.12639 -1.849 0.064466 .
## absences 0.07753 0.02273 3.411 0.000647 ***
## studytime -0.52003 0.15754 -3.301 0.000963 ***
## higherno 1.13706 0.73724 1.542 0.122996
## higheryes 0.67440 0.60457 1.115 0.264638
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 529.56 on 382 degrees of freedom
## Residual deviance: 429.40 on 377 degrees of freedom
## AIC: 439.4
##
## Number of Fisher Scoring iterations: 4
From the summary table I can see that the relationship:
-between high use and famrel is not significant but students with better (higher) relations are less likely in the high use category.
-between absences and high use is highly significant so that students with many absences are more likely in the high use category.
-between study time and high use is also highly significant so that student who spend more time studying are less likely in the high use category.
-between higher education and high use is not significant but is such that students with no willingness to take higher education are more likely in the high use category.
I study the factorial variable ‘higher’ a bit more. I check whether the variable ‘higher’ improves the model fit, I fit one model with (my.mod1) and one without the variable ‘higher’ (my.mod2) and conduct a likelihood ratio test. This tests the hypothesis that all coefficients of higher are zero:
my.mod1 <- glm(high_use ~ famrel + absences+studytime+higher-1, data = alc2, family = "binomial")
my.mod2 <- glm(high_use ~ famrel + absences+studytime-1, data = alc2, family = "binomial")
anova(my.mod1, my.mod2, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ famrel + absences + studytime + higher - 1
## Model 2: high_use ~ famrel + absences + studytime - 1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 377 429.40
## 2 379 431.77 -2 -2.3695 0.3058
This test tells me that having the variable ‘higher’ in the model doesn’t improve it so I can drop it. I fit a new model without it:
#the new model again
m2 <- glm(high_use ~ famrel + absences+studytime, data = alc2, family = "binomial")
#summary table of the new model
summary(m2)
##
## Call:
## glm(formula = high_use ~ famrel + absences + studytime, family = "binomial",
## data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1549 -0.8286 -0.6531 1.1925 2.1845
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.75052 0.59833 1.254 0.209713
## famrel -0.23533 0.12630 -1.863 0.062431 .
## absences 0.07773 0.02252 3.452 0.000557 ***
## studytime -0.54411 0.15551 -3.499 0.000467 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 430.19 on 378 degrees of freedom
## AIC: 438.19
##
## Number of Fisher Scoring iterations: 4
Next I’ll try to dropping the variable ‘famrel’ because it doesn’t seem to be significant (p-value less than 0.05). I fit a model without ‘famrel’.
m3<-glm(formula = high_use ~ absences + studytime, family = "binomial",
data = alc2)
summary(m3)
##
## Call:
## glm(formula = high_use ~ absences + studytime, family = "binomial",
## data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2128 -0.8387 -0.7046 1.1996 2.1832
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16640 0.33954 -0.490 0.624087
## absences 0.08054 0.02285 3.524 0.000425 ***
## studytime -0.55015 0.15550 -3.538 0.000403 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 433.65 on 379 degrees of freedom
## AIC: 439.65
##
## Number of Fisher Scoring iterations: 4
The model without ‘famrel’ has a higher AIC value than the model including it. When comparing AIC values one should choose the model with the lowest AIC value. But if the difference between the best models in less than 2 units it is adviceable to choose the simpler one. Based on this advice I drop ‘famrel’ and my final model includes only absences and studytime as explaining variables.
Next I count the coefficients of the final model as odds ratios and provide confidence intervals for them:
#count odd ratios
OR <- coef(m3) %>% exp
# compute confidence intervals (CI)
CI<-confint(m3)%>%exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.8467110 0.4349166 1.6508083
## absences 1.0838772 1.0386751 1.1361111
## studytime 0.5768622 0.4211716 0.7758799
Odd ratios higher than 1 means that this variable is positively associated with “success”, in this case being in the high use category. In this case, the variable ‘absences’ is positively associated with high use and ‘studytime’ is negatively associated.
From the confidence intervals I can check that 1 doesn’t occur within them because it would mean that the variable has no effect on the success. I can also observe whether the variable has a lot of variation in the intervals. A large variation in the intervals indicates a low level of precision of the odds ratio, whereas a small variation in intervals indicates a higher precision of the odds ratio.
In my case studytime has rather wide confidence intervals and therefore it might have a lower precision of the odds ratio.
To validate my model I use it to make predictions and compare them with the true observed values.
# predict() the probability of high_use
probabilities <- predict(m3, type = "response")
# add the predicted probabilities to 'alc'
alc2 <- mutate(alc2, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability>0.5)
## Warning: package 'bindrcpp' was built under R version 3.4.4
# see the last ten original classes, predicted probabilities, and class predictions
select(alc2, studytime, absences, high_use, probability, prediction) %>% tail(10)
## studytime absences high_use probability prediction
## 373 1 0 FALSE 0.3281537 FALSE
## 374 1 7 TRUE 0.4618903 FALSE
## 375 3 1 FALSE 0.1497827 FALSE
## 376 1 6 FALSE 0.4419431 FALSE
## 377 3 2 FALSE 0.1603317 FALSE
## 378 2 2 FALSE 0.2486902 FALSE
## 379 2 2 FALSE 0.2486902 FALSE
## 380 2 3 FALSE 0.2640419 FALSE
## 381 1 4 TRUE 0.4026660 FALSE
## 382 1 2 TRUE 0.3645990 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 256 12
## TRUE 96 18
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)%>%prop.table()%>%addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67015707 0.03141361 0.70157068
## TRUE 0.25130890 0.04712042 0.29842932
## Sum 0.92146597 0.07853403 1.00000000
Above I produced two confusion tables (in absolute numbers and in percentage) from which I can see how many or what proportion of the predictions were correct.
Then the same thing as a plot:
g <- ggplot(alc2, aes(x = probability, y = high_use,col=prediction))
# define the geom as points and draw the plot
g+geom_point()
In this plot I can see both the actual values and the predictions. I have quite a lot of predictions of falses even though they were actually true. But is not as bad as predicting true even though they were falses in reality. falc
To quantify the goodness of my model I can count the proportion of incorrect predictions.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions
lfm3<-loss_func(class = alc2$high_use, prob = alc2$probability)
lfm3
## [1] 0.2827225
The result is 28 %. I could’ve counted that also from the second cross table (25 + 3 = 28 %). On my opinion that sounds like a lot but I guess it is still better than a simple guessing strategy. 72 % for correct predictions sounds better :)
The problem with this kind of model validation is that the model is asked to predict the same values that are used to create the model. This seems like a circular argument.
Another way of doing model validation is cross-validation. Here a proportion of the data is set aside as testing data and the model is fitted with training data (the rest) only. The resulting model is then fed with testing data to make predictions. The goodness of the model is then quantified as the proportion of correct predictions. This procedure is repeated several times so that the whole dataset has been used as testing data. If the cross-validation is 5-fold it means that the data are split into 5 proportions of which 1/5 is used as testing dataset one at a time. Which data to select as testing data can be random or for example spatially defined.
In R cross-validation can be done easily several times so I reapeat the following r code a couple of times:
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2931937
All the results that I get are close to 29-30 %. This means that with cross-validation my model seems to be performing worse than with the loss function I performed earlier (and the model used in DataCamp). I assume this is because the cross-validation procedure is more profound and therefore gives a more realistic valuation of the prediction accuracy. Without a proper model validation the results might be too over-optimistic.
Next I do a little exploration. I compare the performance of different logistic regression models (= different sets of predictors). I build models using 30, 19, 10 and 5 variables. I count training and testing errors for all the models and compare them.
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
#select 30 variables
chooseM1<-c("age","high_use","famrel","absences","studytime","higher","activities","nursery", "internet","romantic","G3","address","Pstatus","Medu","Fedu","Mjob","Fjob", "reason","guardian","traveltime","failures","schoolsup", "famsup","paid", "freetime", "goout","health","school","sex","famsize")
alcM1<-select(alc,one_of(chooseM1))
#29 explaining variables
M1<-glm(high_use ~ school+higher+sex+age+ address+ famsize+ Medu+ Fedu+ Mjob+ Fjob+ reason+ nursery+ internet+ guardian+ traveltime+ failures+ schoolsup+ famsup+ paid+ freetime+ goout+ health+absences+studytime+famrel+activities+romantic+G3+Pstatus, data = alcM1, family = "binomial")
summary(M1)
##
## Call:
## glm(formula = high_use ~ school + higher + sex + age + address +
## famsize + Medu + Fedu + Mjob + Fjob + reason + nursery +
## internet + guardian + traveltime + failures + schoolsup +
## famsup + paid + freetime + goout + health + absences + studytime +
## famrel + activities + romantic + G3 + Pstatus, family = "binomial",
## data = alcM1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8741 -0.6724 -0.3610 0.4884 2.8798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.15777 3.12129 -1.652 0.098443 .
## schoolMS -0.63603 0.54400 -1.169 0.242336
## higheryes -0.09098 0.69925 -0.130 0.896481
## sexM 0.78325 0.32823 2.386 0.017019 *
## age 0.11618 0.14915 0.779 0.435985
## addressU -0.79383 0.41437 -1.916 0.055398 .
## famsizeLE3 0.50910 0.32421 1.570 0.116356
## Medu 0.04330 0.22061 0.196 0.844378
## Fedu 0.17994 0.19563 0.920 0.357689
## Mjobhealth -1.07131 0.76315 -1.404 0.160382
## Mjobother -0.88901 0.49991 -1.778 0.075350 .
## Mjobservices -0.87464 0.57337 -1.525 0.127151
## Mjobteacher -0.28326 0.69357 -0.408 0.682977
## Fjobhealth 0.39818 1.19257 0.334 0.738465
## Fjobother 0.84816 0.86476 0.981 0.326685
## Fjobservices 1.42043 0.88485 1.605 0.108434
## Fjobteacher -0.44976 1.03350 -0.435 0.663434
## reasonhome 0.46349 0.38041 1.218 0.223067
## reasonother 1.49313 0.54221 2.754 0.005891 **
## reasonreputation -0.07508 0.41153 -0.182 0.855246
## nurseryyes -0.53587 0.36730 -1.459 0.144582
## internetyes 0.05745 0.45599 0.126 0.899736
## guardianmother -0.80193 0.36635 -2.189 0.028601 *
## guardianother -0.20393 0.79701 -0.256 0.798051
## traveltime 0.32510 0.23173 1.403 0.160644
## failures 0.22823 0.26686 0.855 0.392408
## schoolsupyes 0.07851 0.46427 0.169 0.865721
## famsupyes -0.08117 0.32407 -0.250 0.802224
## paidyes 0.59908 0.31690 1.890 0.058699 .
## freetime 0.30046 0.16728 1.796 0.072471 .
## goout 0.86758 0.15351 5.651 1.59e-08 ***
## health 0.18614 0.11321 1.644 0.100130
## absences 0.08983 0.02664 3.372 0.000747 ***
## studytime -0.31530 0.20585 -1.532 0.125599
## famrel -0.56897 0.16717 -3.404 0.000665 ***
## activitiesyes -0.57526 0.30855 -1.864 0.062265 .
## romanticyes -0.54641 0.33105 -1.651 0.098840 .
## G3 0.03375 0.05190 0.650 0.515575
## PstatusT -0.36418 0.53795 -0.677 0.498421
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 322.06 on 343 degrees of freedom
## AIC: 400.06
##
## Number of Fisher Scoring iterations: 5
# predict() the probability of high_use
probabilities <- predict(M1, type = "response")
# add the predicted probabilities to 'alc'
alcM1 <- mutate(alcM1, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcM1 <- mutate(alcM1, prediction = probability>0.5)
# call loss_func to compute the average number of wrong predictions
lfM1<-loss_func(class = alcM1$high_use, prob = alcM1$probability)
#cross-validation
cv <- cv.glm(data = alcM1, cost = loss_func, glmfit = M1, K = 10)
# average number of wrong predictions in the cross validation
cvM1<-cv$delta[1]
#####################################################################
#19 variables
chooseM2<-c("high_use","famrel","absences","studytime","activities","nursery", "romantic","address", "reason","guardian","traveltime","failures","paid", "freetime", "goout","health","school","sex","famsize")
alcM2<-select(alc,one_of(chooseM2))
#18 explaining variables
M2<-glm(high_use ~ school+sex+ address+ famsize+ reason+ nursery+ guardian+ traveltime+ failures+ paid+ freetime+ goout+health+absences+studytime+famrel+activities+romantic, data = alcM2, family = "binomial")
summary(M2)
##
## Call:
## glm(formula = high_use ~ school + sex + address + famsize + reason +
## nursery + guardian + traveltime + failures + paid + freetime +
## goout + health + absences + studytime + famrel + activities +
## romantic, family = "binomial", data = alcM2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9380 -0.6912 -0.3921 0.5668 2.7954
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.24229 1.12611 -1.991 0.046461 *
## schoolMS -0.36846 0.50370 -0.732 0.464470
## sexM 0.83559 0.30472 2.742 0.006104 **
## addressU -0.86275 0.38198 -2.259 0.023905 *
## famsizeLE3 0.48723 0.30501 1.597 0.110173
## reasonhome 0.25059 0.36062 0.695 0.487133
## reasonother 1.11296 0.49712 2.239 0.025168 *
## reasonreputation -0.28950 0.37982 -0.762 0.445936
## nurseryyes -0.41960 0.33948 -1.236 0.216463
## guardianmother -0.72823 0.33009 -2.206 0.027373 *
## guardianother 0.10564 0.73524 0.144 0.885750
## traveltime 0.24550 0.21934 1.119 0.263024
## failures 0.17320 0.23209 0.746 0.455499
## paidyes 0.69857 0.29234 2.390 0.016870 *
## freetime 0.17239 0.15346 1.123 0.261274
## goout 0.84707 0.14158 5.983 2.19e-09 ***
## health 0.13123 0.10449 1.256 0.209140
## absences 0.09286 0.02425 3.830 0.000128 ***
## studytime -0.25968 0.19222 -1.351 0.176709
## famrel -0.48025 0.16126 -2.978 0.002900 **
## activitiesyes -0.43744 0.28545 -1.532 0.125416
## romanticyes -0.50198 0.30813 -1.629 0.103287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 338.38 on 360 degrees of freedom
## AIC: 382.38
##
## Number of Fisher Scoring iterations: 5
# predict() the probability of high_use
probabilities <- predict(M2, type = "response")
# add the predicted probabilities to 'alc'
alcM2 <- mutate(alcM2, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcM2 <- mutate(alcM2, prediction = probability>0.5)
# call loss_func to compute the average number of wrong predictions
lfM2<-loss_func(class = alcM2$high_use, prob = alcM2$probability)
cv <- cv.glm(data = alcM2, cost = loss_func, glmfit = M2, K = 10)
# average number of wrong predictions in the cross validation
cvM2<-cv$delta[1]
##########################################################################3
#10 variables
chooseM3<-c("high_use","famrel","absences","address", "reason","guardian","paid", "goout","sex","famsize")
alcM3<-select(alc,one_of(chooseM3))
#9 explaining variables
M3<-glm(high_use ~ sex+ address+ famsize+ reason+ guardian+ paid+ goout+absences+famrel, data = alcM3, family = "binomial")
summary(M3)
##
## Call:
## glm(formula = high_use ~ sex + address + famsize + reason + guardian +
## paid + goout + absences + famrel, family = "binomial", data = alcM3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9617 -0.7446 -0.4527 0.6513 2.8447
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.13120 0.78716 -2.707 0.006780 **
## sexM 0.99325 0.27671 3.589 0.000331 ***
## addressU -0.91297 0.33198 -2.750 0.005959 **
## famsizeLE3 0.44078 0.29197 1.510 0.131128
## reasonhome 0.15096 0.34314 0.440 0.659987
## reasonother 0.89873 0.46825 1.919 0.054940 .
## reasonreputation -0.53510 0.35986 -1.487 0.137017
## guardianmother -0.70149 0.31737 -2.210 0.027085 *
## guardianother 0.33524 0.69442 0.483 0.629265
## paidyes 0.52463 0.27638 1.898 0.057669 .
## goout 0.86127 0.13257 6.497 8.21e-11 ***
## absences 0.09035 0.02330 3.877 0.000106 ***
## famrel -0.43437 0.14964 -2.903 0.003699 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 354.60 on 369 degrees of freedom
## AIC: 380.6
##
## Number of Fisher Scoring iterations: 5
# predict() the probability of high_use
probabilities <- predict(M3, type = "response")
# add the predicted probabilities to 'alc'
alcM3 <- mutate(alcM3, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcM3 <- mutate(alcM3, prediction = probability>0.5)
# call loss_func to compute the average number of wrong predictions
lfM3<-loss_func(class = alcM3$high_use, prob = alcM3$probability)
cv <- cv.glm(data = alcM3, cost = loss_func, glmfit = M3, K = 10)
# average number of wrong predictions in the cross validation
cvM3<-cv$delta[1]
##########################################################################3
#5 variables
chooseM4<-c("high_use","famrel","absences", "goout","sex")
alcM4<-select(alc,one_of(chooseM4))
#4 explaining variables
M4<-glm(high_use ~ sex+ goout+absences+famrel, data = alcM4, family = "binomial")
summary(M4)
##
## Call:
## glm(formula = high_use ~ sex + goout + absences + famrel, family = "binomial",
## data = alcM4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7151 -0.7820 -0.5137 0.7537 2.5463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76826 0.66170 -4.184 2.87e-05 ***
## sexM 1.01234 0.25895 3.909 9.25e-05 ***
## goout 0.76761 0.12316 6.232 4.59e-10 ***
## absences 0.08168 0.02200 3.713 0.000205 ***
## famrel -0.39378 0.14035 -2.806 0.005020 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 379.81 on 377 degrees of freedom
## AIC: 389.81
##
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use
probabilities <- predict(M4, type = "response")
# add the predicted probabilities to 'alc'
alcM4 <- mutate(alcM4, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcM4 <- mutate(alcM4, prediction = probability>0.5)
# call loss_func to compute the average number of wrong predictions
lfM4<-loss_func(class = alcM4$high_use, prob = alcM4$probability)
cv <- cv.glm(data = alcM4, cost = loss_func, glmfit = M4, K = 10)
# average number of wrong predictions in the cross validation
cvM4<-cv$delta[1]
#create a data frame containing the number of variables and training and testing errors
Nvariables<-c(5,10,19,30)
errors<-as.data.frame(Nvariables)
errors$train=NA
errors[1, 2] = lfM4
errors[2, 2] = lfM3
errors[3, 2] = lfM2
errors[4, 2] = lfM1
errors$test=NA
errors[1, 3] = cvM4
errors[2, 3] = cvM3
errors[3, 3] = cvM2
errors[4, 3] = cvM1
#plot the results
dfplot <- errors %>% gather(key, value, -Nvariables)
ggplot(dfplot, mapping = aes(x = Nvariables, y = value, color = key) ) + geom_line()
It seems that reducing the number of explaining variables from 30 to 5 has little effect on training errors whereas for testing errors it has a significant negative effect. It reduces the testing error and models with less variables seem to give more accurate predictions.
I’m not sure what exactly I can interpret from this result because as I reduced the number of explaining variables I dropped them based on their statistical insignificance.
This week I learned about clustering and classification. How to cluster observations, how to study which factors affect or justify clustering, how many clusters is appropriate etc?
This week’s data comes from an R package called MASS:
# access the MASS package
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The dataset consist of 14 variables and 506 observations. All variables are numerical. One variable (‘chas’) is a 1/0, presence/absence dummy variable. The variables describe housing values in suburbs of Boston and factors measured at the suburbs which are thought to be related with housing values. Factors include measures of for example crime rate, access to Charles River, nitrogen oxides concentration, average number of rooms per dwelling, distances to five Boston employment centres, accessibility to radial highways, proportion of blacks by town and median value of owner-occupied homes. The full details can be found here.
Let’s explore graphically the distributions and relations of the data:
pairs(Boston)
This plot is difficult to read. I’ll figure out later how to improve the quality of the output. From the summary table I’m however able to explore the variation and distributions of the variables.
Here are two dotplots of variables ‘crim’ (per capita crime rate by town) and ‘zn’ (proportion of residential land zoned for lots over 25,000 sq.ft.) which show that these two variables are not very evenly distributed:
dotchart(Boston$crim)
dotchart(Boston$zn)
Some variables seem correlated. Here’s a correlation matrix of the variables:
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
library(magrittr)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>%round(digits=2)
# print the correlation matrix
print(cor_matrix)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
The above matrix is not very readable as it extends into two separate parts. Let’s present the correlations in a nicer way.
# visualize the correlation matrix
corrplot(cor_matrix, method="circle",type="upper",cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
This plot is easier to read. The bigger the circle the more correlated the variables are. Red indicates negative correlation and blue indicated positive correlation.
Some of the variables have very high values and wide distributions. We want to scale all variables because later on it may be difficult to sum or average variables that are on different scales. Scaling can be done to all variables in the dataset as they are all numerical.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
Now all the variables have their mean at zero and their distributions are more moderate.
Next I create a categorical variable of the crime rate in the Boston dataset. I use quantiles as the break points. I drop the old crime rate variable from the dataset.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
For later model evaluation purposes I divide the dataset into training and testing datasets, so that 80% of the data belongs to the train set:
##dividing the data into training and testing sets
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Next I want to know which variables might explain the target variable crime rate. I do a linear discriminant analysis with the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables:
# linear discriminant analysis
lda.fit <- lda(crime~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2599010 0.2574257 0.2301980
##
## Group means:
## zn indus chas nox rm
## low 0.98949884 -0.9030454 -0.079333958 -0.8762533 0.4580255
## med_low -0.08113245 -0.3066922 -0.009855719 -0.5847777 -0.1216352
## med_high -0.39488943 0.2350252 0.295521928 0.3961168 0.0294096
## high -0.48724019 1.0149946 -0.060657012 0.9962637 -0.3890523
## age dis rad tax ptratio
## low -0.9105094 0.8782527 -0.6722355 -0.7354632 -0.41691230
## med_low -0.2838192 0.3585119 -0.5498288 -0.4764756 -0.07008217
## med_high 0.3788274 -0.3736974 -0.3976990 -0.2749146 -0.19220369
## high 0.7885245 -0.8453531 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.3741162 -0.7779813 0.52377288
## med_low 0.3083986 -0.1250603 -0.01019437
## med_high 0.1165666 0.0477216 0.12094959
## high -0.8988923 0.8471605 -0.69043588
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.083997038 0.7173626033 -0.89323527
## indus 0.033505611 -0.2257427905 0.21042219
## chas -0.064677615 -0.0789752501 0.03358591
## nox 0.255917654 -0.8223216412 -1.54772191
## rm -0.101041075 0.0005479474 -0.12551422
## age 0.301129407 -0.2488880179 0.15659154
## dis -0.106519457 -0.2995733219 0.20894309
## rad 3.190005830 0.9575551654 -0.02215920
## tax -0.003077552 -0.0628363337 0.57200802
## ptratio 0.133296450 -0.0462622218 -0.36172841
## black -0.180035400 -0.0310410121 0.07040604
## lstat 0.119194098 -0.2274669768 0.33977727
## medv 0.168574061 -0.4483971838 -0.23260203
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9431 0.0427 0.0142
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 1)
Variable ‘rad’ looks like a strong classifying factor. Also ‘zn’ and ‘nox’ are dividing the observations.
Next I want to use the observations in the test set to predict crime classes. I do this because I want to estimate the “goodness” of my model by comparing predictions to observed “real” data.
For prediction I use the LDA model on the test data. For comparison I tabulate the results with the crime categories from the test set:
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 10 2 0
## med_low 2 13 6 0
## med_high 0 5 17 0
## high 0 0 1 33
I did the random division of train and test data and predicted the above classes twice. First I got a fairly poor result with more than half of the med_high cases predicted incorrectly. On the second round the results look better (results shown here). Some classes are still incorrectly predicted but at least most of the predictions are correct.
Next I study the boston data without any classifications and try to cluster the data into groups. Maybe the observations form clusters according to the suburbs. I run k-means algorithm on the dataset, investigate what is the optimal number of clusters and run the algorithm again.
First I reload the Boston dataset and standardize it. Then I calculate the Euklidean distances between the observations and present a summary of the distances:
#standardize the data set
boston_scaled2 <- scale(Boston)
# class of the boston_scaled object
class(boston_scaled2)
## [1] "matrix"
# change the object to data frame
boston_scaled2<-as.data.frame(boston_scaled2)
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next I run the k-means clustering with 3 centers.
# k-means clustering
km <-kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2[9:14], col = km$cluster)
I zoomed in to various parts of the plot and found that when looking at the variable ‘tax’ it is divided into clusters so that at least the black observations belong clearly to their own group.
I also explored the clustering with 5 centers. The grouping seemed even more arbitrary.
Now, I’m not sure about the best number of clusters so I count the total of within cluster sum of squares (WCSS) and see how it behaves when the number of clusters change:
library(ggplot2)
# set values
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The total WCSS drops dramatically at around the value 2. That is the optimal number of clusters for this dataset.
I run the clustering again with 2 centers:
# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2[1:6], col = km$cluster)
Now the clustering seems better, at least for some variable pairs. But on my opinion, having only two groups doesn’t tell much. Maybe it suggests that the residents in Boston are divided into two groups, the wealthy and the poor?
Next I perform the LDA again to the boston dataset, this time with clusters (3) as the target variable. By visualizing the results with a biplot I can interpret which variables influence the clustering.
boston_scaled3<-boston_scaled2
# k-means clustering
km <-kmeans(boston_scaled3, centers = 3)
klusteri<-km$cluster
class(klusteri)
## [1] "integer"
boston_scaled3<-cbind(boston_scaled3,klusteri)
summary(boston_scaled3)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv klusteri
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063 Min. :1.000
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989 1st Qu.:1.000
## Median : 0.3808 Median :-0.1811 Median :-0.1449 Median :2.000
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean :1.972
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683 3rd Qu.:3.000
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865 Max. :3.000
# linear discriminant analysis
lda.fit2 <- lda(klusteri~., data = boston_scaled3)
# print the lda.fit object
lda.fit2
## Call:
## lda(klusteri ~ ., data = boston_scaled3)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3003953 0.4268775 0.2727273
##
## Group means:
## crim zn indus chas nox rm
## 1 0.8942488 -0.4872402 1.0913679 -0.01330932 1.1109351 -0.4609873
## 2 -0.3688324 -0.3935457 -0.1369208 0.07398993 -0.1662087 -0.1700456
## 3 -0.4076669 1.1526549 -0.9877755 -0.10115080 -0.9634859 0.7739125
## age dis rad tax ptratio black
## 1 0.7828949 -0.84882600 1.3656860 1.3895093 0.63256391 -0.7083974
## 2 0.1673019 -0.07766431 -0.5799077 -0.5409630 -0.04596655 0.2680397
## 3 -1.1241828 1.05650031 -0.5965522 -0.6837494 -0.62478941 0.3607235
## lstat medv
## 1 0.90799414 -0.69550394
## 2 -0.05818052 -0.04811607
## 3 -0.90904433 0.84137443
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.043702606 0.16161136
## zn 0.049248495 0.76920932
## indus -0.331498698 0.02870425
## chas -0.012406954 -0.11314905
## nox -0.721972554 0.40566595
## rm 0.174541989 0.41632858
## age 0.006221178 -0.88117192
## dis 0.043869924 0.36910493
## rad -1.256861546 0.47665247
## tax -0.992855786 0.46457291
## ptratio -0.092336951 -0.01003010
## black 0.073915653 -0.03513128
## lstat -0.372145848 0.38403679
## medv -0.058153798 0.49571753
##
## Proportion of trace:
## LD1 LD2
## 0.8785 0.1215
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(boston_scaled3$klusteri)
# plot the lda results
plot(lda.fit2, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit2, myscale = 1)
From these results I would interpret that the variable ‘rad’ (index of accessibility to radial highways) is the strongest linear separator in this dataset. Although many other variables follow not far behind.
Next I’ll draw some 3D plots of the training data. NOTE! You might have to click around to see the figure.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## Warning: package 'plotly' was built under R version 3.4.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
#Set the color to be the crime classes of the train set. Draw another 3D plot where the #color is defined by the clusters of the k-means. How do the plots differ?
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color=train$crime)
I stop the exercise here because I’m having trouble understanding the instructions. I’m able to draw these two 3D plots and crime seems to be a strong separator in the dataset. The last plot should demonstrate the division by clusters. However, I’m not sure anymore should I do the k-means clustering again to the training data and then change the plotting code or could I do it just by modifying the color argument. I leave it here.